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Abstract The extraction of multi-attribute objects from the 
deep web is the bridge between the unstructured web and 
structured data. Existing approaches either induce wrappers 
from a set of human-annotated pages or leverage repeated 
structures on the page without supervision. What the for- 
mer lack in automation, the latter lack in accuracy. Thus 
accurate, automatic multi-attribute object extraction has re- 
mained an open challenge. 

Amber overcomes both limitations through mutual su- 
pervision between the repeated structure and automatically 
produced annotations. Previous approaches based on auto- 
matic annotations have suffered from low quality due to the 
inherent noise in the annotations and have attempted to com- 
pensate by exploring multiple candidate wrappers. In con- 
trast, Amber compensates for this noise by integrating re- 
peated structure analysis with annotation-based induction: 
The repeated structure limits the search space for wrapper 
induction, and conversely, annotations allow the repeated 
structure analysis to distinguish noise from relevant data. 
Both, low recall and low precision in the annotations are 
mitigated to achieve almost human quality (> 98%) multi- 
attribute object extraction. 

To achieve this accuracy, Amber needs to be trained 
once for an entire domain. Amber bootstraps its training 
from a small, possibly noisy set of attribute instances and a 
few unannotated sites of the domain. 



1 Introduction 

The "web of data" has become a meme when talking about 
the future of the web. Yet most of the objects published on 
the web today are only published through HTML interfaces. 
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Though structured data is increasingly available for common 
sense knowledge such as Wikipedia, transient data such as 
product offers is at best available from large on-line shops 
such as Amazon or large-scale aggregators. 

The aim to extract objects together with their attributes 
from the web is almost as old as the web. Its realisation has 
focused on exploiting two observations about multi- attribute 
objects on the web: (1) Such objects are typically presented 
as list, tables, grids, or other repeated structures with a 
common template used for all objects. (2) Websites are de- 
signed for humans to quickly identify the objects and their 
attributes and thus use a limited visual and textual vocab- 
ulary to present objects of the same domain. For example, 
most product offers contain a prominent price and image. 

Previous approaches have focused either on highly ac- 
curate, but supervised extraction, where humans have to an- 
notate a number of example pages for each site, or on unsu- 
pervised, but low accuracy extraction based on detecting re- 
peated structures on any web page: Wrapper induction [12, 
[T61I21 11241125 U3Q[[T9l and semi- supervised approaches [3 , 26 ] 
are of the first kind and require manually annotated exam- 
ples to generate an extraction program (wrapper). Though 
such annotations are easy to produce due to the above ob- 
servations, it is nevertheless a significant effort, as most sites 
use several types or variations of templates that each need to 
be annotated separately: Even a modern wrapper induction 
approach [19 ] requires more than 20 pages per site, as most 
sites require training for more than 10 different templates. 
Also, wrapper induction approaches are often focused on ex- 
tracting a single attribute instead of complete records, as for 
example in H251I12L 

On the other hand, the latter, fully unsupervised, 
domain-independent approaches [10 , 23 , 28 , 29 , 33 , 36 ], suf- 
fer from a lack of guidance on which parts of a web site 
contain relevant objects: They often recognise irrelevant, 
but regular parts of a page in addition to the actual objects 
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and are susceptible to noise in the regular structure, such 
as injected ads. Together this leads to low accuracy even 
for the most recent approaches. This limits their applica- 
bility for turning an HTML site into a structured database, 
but fits well with web- scale extraction for search engines 
and similar settings, where coverage rather than recall is 
essential (see 0): From every site some objects or pages 
should be extracted, but perfect recall is not achievable at 
any rate and also not necessarily desirable. To improve pre- 
cision these approaches only consider object extraction from 
certain structures, e.g., tables Q or lists ifHll and are thus 
not applicable for general multi-attribute object extraction. 

This lack of accurate, automated multi-attribute extrac- 
tion has led to a recent focus in data extraction approaches 
H31l321[T4l on coupling repeated structure analysis, exploit- 
ing observation (1), with automated annotations (exploiting 
observation (2), that most websites use similar notation for 
the same type of information). What makes this coupling 
challenging is that both the repeated structure of a page and 
the automatic annotations produced by typical annotators 
exhibit considerable noise. 1 13 ] and |32] address both types 
of noise, but in separation. In [ 32 ] this leads to very low 
accuracy, in lf]~3l to the need to considerable many alterna- 
tive wrappers, which is feasible for single-attribute extrac- 
tion but becomes very expensive for multi- attribute object 
extraction where the space of possible wrappers is consider- 
ably larger. lfT4l addresses noise in the annotations, but relies 
on a rigid notation of separators between objects for its tem- 
plate discovery which limits the types of noise it can address 
and results in low recall. 

To address these limitations, Amber tightly integrates 
repeated structure analysis with automated annotations, 
rather than relying on a shallow coupling. Mutual supervi- 
sion between template structure analysis and annotations al- 
lows Amber to deal with significant noise in both the an- 
notations and the regular structure without considering large 
numbers of alternative wrappers, in contrast to previous ap- 
proaches. Efficient mutual supervision is enabled by a novel 
insight based on observation (2) above: that in nearly all 
product domains there are one or more regular attributes, 
attributes that appear in almost every record and are visually 
and textually distinct. The most common example is price, 
but also the make of a car or the publisher of a book can 
serve as regular attribute. By providing this extra bit of do- 
main knowledge, Amber is able to efficiently extract multi- 
attribute objects with near perfect accuracy even in presence 
of significant noise in annotations and regular structure. 

Guided by occurrences of such a regular attribute, Am- 
ber performs a fully automated repeated structure analysis 
on the annotated DOM to identify objects and their attributes 
based on the annotations. It separates wrong or irrelevant 
annotations from ones that are likely attributes and infers 
missing attributes from the template structure. 



Amber's analysis follows the same overall structure 
of the repeated structure analysis in unsupervised, domain- 
independent approaches: (1) data area identification where 
Amber separates areas with relevant data from noise, such 
as ads or navigation menus, (2) record segmentation where 
Amber splits data areas into individual records, and (3) at- 
tribute alignment where Amber identifies the attributes of 
each record. But unlike these approaches, the first two steps 
are based on occurrences of a regular attribute such as price: 
Only those parts of a page where such occurrences appear 
with a certain regularity are considered for data areas, elimi- 
nating most of the noise produced by previous unsupervised 
approaches, yet allowing us to confidently deal with pages 
containing multiple data areas. Within a data area, theses oc- 
currences are used to guide the segmentation of the records. 
Also the final step, attribute alignment, differs notably from 
the unsupervised approaches: It uses the annotations (now 
for all attribute types) to find attributes that appear with suf- 
ficient regularity on this page, compensating both for low 
recall and for low precision. 

Specifically, Amber's main contributions are: 

(1) Amber is the first multi-attribute object extraction 
system that combines very high accuracy (> 95%) with zero 
site-specific supervision. 

(2) Amber achieves this by tightly integrating repeated 
structure analysis with induction from automatic annota- 
tions: In contrast to previous approaches, it integrates these 
two parts to deal with noise in both the annotations and 
the regular structure, yet avoids considering multiple alter- 
native wrappers by guiding the template structure analysis 
through annotations for a regular attribute type given as part 
of the domain knowledge: (a) Noise in the regular structure: 
Amber separates data areas which contain relevant objects 
from noise on the page (including other regular structures 
such as navigation lists) by clustering annotations of regu- 
lar attribute types according to their depth and distance on 



the page (Section [33] ). Amber separates records, i.e., regu- 
lar occurrences of relevant objects in a data area, from noise 
between records such as advertisements through a regular- 
ity condition on occurrences of regular attribute types in a 



data area (Section 3.4 ). (b) Noise in the annotations: Finally, 



Amber addresses such noise by exploiting the regularity of 
attributes in records, compensating for low recall by invent- 
ing new attributes with sufficient regularity in other records, 
and for low precision by dropping annotations with insuf- 



ficient such regularity (Section [33] ). We show that Amber 
can tolerate significant noise and yet attain above 98% accu- 
racy, dealing with, e.g., 50 false positive locations per page 
on average (Section [6]). (c) Guidance: The annotations of 
regular attributes are also exploited to guide the search for 
a suitable wrapper, allowing us to consider only a few, local 
alternatives in the record segmentation (Section [34] ), rather 
than many wrappers, as necessary in fT3l (see Section [7]). 
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(3) To achieve such high accuracy, Amber requires a 
thin layer of domain knowledge consisting of annotators for 
the attribute types in the domain and the identification of 
a regular attribute type. In Section |4| we give a methodol- 
ogy for minimising the effort needed to create this domain 
knowledge: From a few example instances (collected in a 
gazetteer) for each attribute type and a few, unannotated re- 
sult pages of the domain, Amber can automatically boot- 
stap itself by verifying and extending the existing gazetteers. 
This exploits Amber's ability to extract some objects even 
with annotations that have very low accuracy (around 20%). 
Only for regular attribute types a reasonably accurate anno- 
tator is needed from the beginning. This is easy to provide 
in product domains where price is such an attribute type. 
In other domains, we have found producers such as book 
publishers or car makers a suitable regular attribute type for 
which accurate annotators are also easy to provide. 

(4) We evaluate Amber on the UK real-estate and used 
cars markets against a gold standard consisting of manually 
annotated pages from 150 real estate sites (281 pages) and 
100 used car sites (150 pages). Thereby, Amber is robust 
against significant noise: Increasing the error rate in the an- 
notations from 20% to over 70%, drops Amber's accuracy 



by only 3%. (Sectionj^Tj). (a) We evaluate Amber on 2,215 
pages from 500 real estate sites by automatically checking 
the number of extracted records (20,723 records) and re- 
lated attributes against the expected extrapolated numbers 
(Section [672] ). (b) We compare Amber with RoadRun- 
ner [10] and MDR (23, demonstrating Amber's superi- 



ority (Section 6.3 ). (c) At last, we show that Amber can 
learn a gazetteer from a seed gazetteer, containing 20% of 
a complete gazetteer, thereby improving its accuracy from 
50.5% to 92.7%. 

While inspired by earlier work on rule-driven result page 
analysis fTTlL this paper is the first complete description of 
Amber as a self- supervised system for extracting multi- 
attribute objects. In particular, we have redesigned the in- 
tegration algorithm presented in Section|3]to deal with noise 
in both annotators and template structure. We have also re- 
duced the amount of domain knowledge necessary for Am- 
ber and provide a methodology for semi-supervised acqui- 
sition of that domain knowledge from a minimal set of ex- 
amples, once for an entire domain. Finally, we have signifi- 
cantly expanded the evaluation to reflect these changes, but 
also to provide deeper insight into Amber. 

1 . 1 Running Example 

We illustrate Amber on the result page from Rightmove, 
the biggest UK real estate aggregator. Figure [T] shows the 
typical parts of such pages: On top, (1) some featured prop- 
erties are arranged in a horizontal block, while directly 
below, separated by an advertisement, (2) the properties 
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Fig. 1: Result Page on rightmove .co.uk 



matching the user's query are listed vertically. Finally, on the 
left-hand side, a block (3) provides some filtering options to 
refine the search result. At the bottom of Figure [T] we zoom 
into the third record, highlighting the identified attributes. 

After annotating the DOM of the page, Amber analyzes 
the page in three steps: data area identification, record seg- 
mentation, and attribute alignment. In all these steps we ex- 
ploit annotations provided by domain- specific annotators, in 
particular for regular attribute types, here price, to distin- 
guish between relevant nodes and noise such as ads. 

For Figure[T] Amber identifies price annotations (high- 
lighted in green, e.g., "£995 pcm"), most locations (purple), 
the number of bedrooms (orange) and bathrooms (yellow). 
The price on top (with the blue arrow), the "1 bedroom" 
in the third record, and the crossed out price in the sec- 
ond record are three examples of false positives annotations, 
which are corrected by Amber subsequently. 

Data area identification. First, Amber detects which parts 
of the page contain relevant data. In contrast to most other 
approaches, Amber deals with web pages displaying mul- 
tiple, differently structured data areas. E.g., in Figure[T] Am- 
ber identifies two data areas, one for the horizontally re- 
peated featured properties and one for the vertically repeated 
normal results (marked by red boxes). 

Where other approaches rely solely on repeated struc- 
ture, Amber first identifies pivot nodes, i.e., nodes on the 
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page that contain annotations for regular attribute types, here 
price. Second, Amber obtains the data areas as clusters 
of continuous sequences of pivot nodes which are evenly 
spaced at roughly the same DOM tree depth and distance 
from each other. For example, Amber does not mistake the 
filter list (3) as a data area, despite its large size and regu- 
lar structure. Approaches only analyzing structural or visual 
structures may fail to discard this section. Also, any annota- 
tion appearing outside the found areas is discarded, such as 
the price annotation with the blue arrow atop of area (1). 

Record segmentation. Second, Amber needs to segment 
the data area into "records", each representing one multi- 
attribute object. To this end, Amber cuts off noisy pivot 
nodes at the head and tail of the identified sequences and re- 
moves interspersed nodes, such as the crossed out price in 
the second record. The remaining pivot nodes segment the 
data area into fragments of uniform size, each with a highly 
regular structure, but additional shifting may be required as 
the pivot node does not necessarily appear at the beginning 
of the record. Among the possible record segmentations the 
one with highest regularity among the records is chosen. In 
our example, Amber correctly determines the records for 
the data areas (1) and (2), as illustrated by the dashed lines. 
Amber prunes the advertisement in area (2) as inter-record 
noise, since it would lower the segmentation regularity. 

Attribute alignment. Finally, Amber aligns the found an- 
notations within the repeated structure to identify the record 
attributes. Thereby, Amber requires that each attribute oc- 
curs in sufficiently many records at corresponding positions. 
If this is the case, it is well-supported, and otherwise, the 
annotation is dropped. Conversely, a missing attribute is in- 
ferred, if sufficiently many records feature an annotation of 
the same type at the position in concern. For example, all lo- 
cation annotations in data area 2 share the same position, and 
thus need no adjustment. However, for the featured proper- 
ties, the annotators may fail to recognize "Medhurst Way" as 
a location. Amber infers nevertheless that "Medhurst Way" 
must be a location (as shown in Figure [TJ, since all other 
records have a location at the corresponding position. For 
data area 2, bathroom and bedroom number are shown re- 
spectively at the same relative positions. However, the third 
record also states that there is a separate flat to sublet with 
one bedroom. This node is annotated as bedroom number, 
but Amber recognizes it is false positive due to the lack of 
support from other records. 

To summarise, Amber addresses low recall and pre- 
cision of annotations in the attribute alignment, as it can 
rely on an already established record segmentation to deter- 
mine the regularity of the attributes. In addition it compen- 
sates for noise in the annotations for regular attribute types 



in the record segmentation by majority voting to determine 
the length of a record and by dropping irregular annotations 
(such as the crossed out price in record 2). Amber also ad- 
dresses noise in the regular structure on the page, such as 
advertisements between records and regular, but irrelevant 
areas on the page such as the refinement links. All this comes 
at the price of requiring some domain knowledge about the 
attributes and their instances in the domain, that can be eas- 
ily acquired from just a few examples, as discussed in Sec- 
tion m 

2 Multi- Attribute Object Extraction 

2.1 Result Page Anatomy 

Amber extracts multi-attribute objects from result pages, 
i.e., pages that are returned as a response to a form query on 
a web site. The typical anatomy of a result page is a repeated 
structure of more or less complex records, often in form of a 
simple sequence. Figure [T] shows a typical case, presenting 
a paginated sequence of records, each representing a real 
estate property to rent, with a price, a location, the number 
of bed and bath rooms. 

We call record each instance of an object on the page 
and we refer to a group of continuous and similarly struc- 
tured records as data area. Then, result pages for a schema 
E = Er U Eq that defines the optional and regular attribute 
types of a domain have the following characteristics: Each 
data area consists of (Dl) a maximal and (D2) continuous 
sequence of records, while each record (D3) is a sequence 
of children of the data area root, and consists of (Rl) a con- 
tinuous sequence of sibling subtrees in the DOM tree. For all 
records, this sequence is of (R2) the same length, of (R3) the 
same repeating structure, and contains (R4) in most cases 
one instance of each regular attribute in Er. Furthermore, 
each record may contain (R5) instances of some optional 
attributes Eq, such that attributes for all attribute types in 
Er U Eq (R6) appear at similar positions within each record, 
if they appear at all. For attributes, we note that relevant at- 
tributes (Al) tend to appear early within their record, with 
(A2) its textual content filling a large part of their surround- 
ing text box. Also (A3) attributes for optional attribute types 
tend to be less standardized in their values, represented with 
more variations. 

Result pages comes in many shapes, e.g., grids, 
like the one depicted in Figure [2] taken from the 
appalachian realty . com real estate website, tables, or even 
simple lists. The prevalent case, however, is the sequence of 
individual records as in Figure [T] 

Many result pages on the web are regular, but many also 
contain considerable noise. In particular, an analysis must 
(Nl) tolerate inter-record noise, such as advertisements be- 
tween records, and (N2) intra-record noise, such as instances 
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Fig. 2: A grid result page. 



of attribute types such as price occurring also in product 
descriptions. It must also (N3) address pages with multi- 
ple data areas distinguish them from regular, but irrelevant 
noise. . 

Further Examples. Consider a typical result page from 
Zoopla.co.uk (Figure [3}. Here we have two distinct data 
areas where records are laid out using different templates. 
Premium (i.e., sponsored) results appear in the top data area 
(A), while regular results appear in the bottom data area (B). 
A wrapper generation system must be able to cluster the 
two kinds of records and distinguish the different data areas. 
Once the two data areas have been identified, the analysis of 
the records does not pose particular difficulties since, within 
each data area, the record structure is very regular. 

Another interesting case is the presence of highlighted 
results like in Figure]?] again taken from Rightmov e . co . uk| 
where premium records (A) are diversified from other re- 
sults (B) within the same data area. This form of highlight- 
ing can easily complicate the analysis of the page and the 
generation of a suitable wrapper. 



2.2 Extraction Typing 

For extracting multi-attribute objects, we output a data struc- 
ture describing each object and its attributes, such as origin, 
departure time, and price. In addition, to automatically in- 
duce wrappers, Amber needs not only to extract this data 
but must also link the extracted data to its representation 
on the originating pages. To that end, Amber types nodes 
in the DOM for extraction (extraction typing) to describe 
(1) how objects appear on the page as records, (2) how at- 
tributes are structured within records, and (3) how records 
are grouped into data areas. In supervised wrapper induc- 
tion systems, this typing is usually provided by humans 
"knowing" the objects and their attributes. But in fully un- 
supervised induction, also the generation of the extraction 
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typing is automated. To formalise extraction typing, we first 
define a web page and then type its nodes according to a 
suitable domain schema. 

Web pages. Following [4], we represent a web page as 
its DOM tree P = ((U)ueunary, child, next-sibl) where each 
A G (U)ueunary is a unary relation to label nodes with A, 
child(/7,c) holds if p is a parent node of c, and next-sibl (s,s f ) 
holds if s f is the sibling directly following s. In abuse of no- 
tation, we refer to P also as the set of DOM nodes in P. 
Further relations, e.g., descendant and following, are derived 
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from these basic relations. We write x -< y, if x is a preceding 
sibling of y, and we write x ^ y for x -< y or x = y. For all 
nodes n and n f , we define the sibling distance n — sib | ft' with 

{n^ri : l^ln^ifc^n'}! 
n 1 <n : —\{k \ n f k ^ n}\ 
otherwise : oo 

Finally, first-child(/?,c) holds if c is the first child of p, i.e., if 
there is no other child c' of p with c' -< c. 

Extraction Typing. Intuitively, data areas, records, and at- 
tributes are represented by (groups of) DOM nodes. An 
extraction typing formalizes this in typing the nodes ac- 
cordingly to guide the induction of a suitable wrapper for 
pages generated from the same template and relies on a do- 
main schema for providing attribute types. We distinguish 
attribute types into regular and optional, the latter indicating 
that attributes of that type typically occur only in some, but 
not all records. 

Definition 1 A domain schema E = ErUEq defines dis- 
joint sets Er and Eq of regular and optional attribute types. 

Definition 2 Given a web page with DOM tree P, an ex- 
traction typing for domain schema E = Er U Eq is a rela- 
tion T : P x (E U {d, rs, rt}) where each node n G P with 

(1) T(n,d) contains a data area, with 

(2) T(fl,rs), n represents a record that spans the subtrees 
rooted at n and its subsequent siblings n' . For all these 
subsequent siblings n 1 we have 

(3) T(n' , rt), marking the tail of the record. 

(4) T(n,p) holds, if n contains an attribute of type p G E. 

Data areas may not be nested, neither may records, but 
records must be children of a data area, and attributes must 
be descendants of a (single) record. 

Definition 3 Given an extraction typing T, a node n is part 
of a record r, written partOf7~(ft,r), if the following condi- 
tions hold: T(r,rs) holds, n occurs in a subtree rooted at 
node r' with T(r 7 , rs) or T(r', rt), and there is no node r" be- 
tween r and r' with T(r" , rs) . A record r is part of a data area 
d, written partOf^r, d), if r is a child of d, and transitively, 
we have partOf7-(ft,d) for partOf7-(n,r) and partOf7-(r,d). 

3 The AMBER Approach 

Following the result page anatomy from the preceding sec- 
tion, the extraction of multi- attribute objects involves three 
main tasks: (1) Identifying data areas with relevant objects 
among other noisy contents, such as advertisements or nav- 
igation menus, (2) segmenting such data areas into records, 
i.e., representations of individual objects, and (3) aligning 



attributes to objects, such that all records within the same 
data area feature a similar attribute structure. 

An attempt to exploit properties (Dl-3), (Rl-6), and 
(Al-3) directly, leads to a circular search: Data areas are 
groups of regularly structured records, while records are 
data area fragments that exhibit structural similarities with 
all other records in the same area. Likewise, records and at- 
tributes are recognized in mutual reference to each other. 
Worse, automatically identifying attribute values is a natu- 
rally noisy process based on named entity recognition (e.g., 
for locations) or regular expressions (e.g., for postcodes or 
prices). Hence, to break these cyclic dependencies, we draw 
some basic consequences from the above characterization. 
Intuitively, these properties ensure that the instances of each 
regular attribute p eEr constitute a cluster in each data area, 
where each instance occurs (D4) roughly at the same depth 
in the DOM tree and (D5) roughly at the same distance. 

Capitalizing on these properties, and observing that it 
is usually quite easy to identify the regular attributes Er 
for specific application domains, Amber relies on occur- 
rences of those regular attributes to determine the records 
on a page: Given an annotator for a single such attribute 
71 G Er (called pivot attribute type), Amber fully auto- 
matically identifies relevant data areas and segments them 
into records. Taking advantage of the repeating record struc- 
ture, this works well, even with fairly low quality annota- 
tors, as demonstrated in Section [6] For attribute alignment, 
Amber requires corresponding annotators for the other do- 
main types, also working with low quality annotations. For 
the sake of simplicity, we ran Amber with a single pivot 
attribute per domain - achieving strong results on our evalu- 
ation domains (UK real estate and used car markets). How- 
ever, one can run Amber in a loop to analyze each page 
consecutively with different pivot attributes to choose the 
extraction instance which covers most attributes on the page. 

Once, a pivot attribute type has been chosen, Amber 
identifies and segments data areas based on pivot nodes, i.e., 
DOM nodes containing instances of the pivot attribute: Data 
areas are DOM fragments containing a cluster of pivot nodes 
satisfying (D4) and (D5), and records are fragments of data 
areas containing pivot nodes in similar positions. Once data 
areas and records are fixed, we refine the attributes identified 
so far by aligning them across different records and adding 
references to the domain schema. With this approach, Am- 
ber deals incomplete and noisy annotator (see Section]?]), 
created with little effort, but still extracts multi- attribute ob- 
jects without significant overhead, as compared to single at- 
tribute extraction. 

Moreover, Amber deals successfully with the noise oc- 
curring on pages, i.e., it (Nl) tolerates inter-record noise by 
recognizing the relevant data via annotations, (N2) tolerates 
intra-record variances by segmenting records driven by reg- 
ular attributes, and it (N3) address multi-template pages by 
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Algorithm 1: amber (P,T,£) 



input : P - DOM to be analyzed 
input : Z = ZrUZo - schema for the searched results, 
with a specifically marked pivot attribute % G Er 
output : T - extraction typing on P 

1 annotate (P,Z,ann); 

2 identify (P,T,7T,ann, pivots); 

3 segment (P,T, pivots); 

4 align (P,T,X); 

5 learn (T,ann); 



considering each data area separately for record segmenta- 
tion. 



3.1 Algorithm Overview 

The main algorithm of Amber, shown in Algorithm [T] and 
Figure |5j takes as inputs a DOM tree P and a schema 
E = ZrUEo, with a regular attribute type 71 G Er marked 
as pivot attribute type, to produce an extraction typing T. 
First, the annotations ann for the DOM P are computed 
as described in Section |3.2| (Line [TJ. Then, the extraction 
typing T is constructed in three steps, by identifying and 
adding the data areas (Line [2]), then segmenting and adding 
the records (Line [3]), and finally aligning and adding the at- 
tributes (LineQ. All three steps are discussed in Sections [33 
to - 



3.5| Each step takes as input the DOM P and the incre- 
mentally expanded extraction typing T. The data area iden- 
tification takes as further input the pivot attribute type n (but 
not the entire schema E), together with the annotations ann. 
It produces - aside the data areas in T - the sets pivots (d) 
of pivot nodes supporting the found data areas d. The record 
segmentation requires these pivots to determine the record 
boundaries to be added to T, working independently from 
E. Only the attribute alignment needs the schema E to type 
the DOM nodes accordingly. At last, deviances between the 
extraction typing T and the original annotations ann are ex- 
ploited in improving the gazetteers (Line [5]) - discussed in 
Section^] 



3.2 Annotation Model 

During its first processing step, Amber annotates a given 
input DOM to mark instances of the attribute types oc- 



curring in E = Er UEq. We define these annotations with 
a relation ann : E x N x U, where N is the DOM node 
set, and U is the union of the domains of all attribute 
types in E. ann(A,ft,v) holds, if n is a text node con- 
taining a representation of a value v of attribute type A. 
For the HTML fragment <span>0xford,£2k</span>, we obtain, 
e.g., ann(i_ocATioiV, "Oxford") and ann(pRiCE,£,"2000"), 
where t is the text node within the span. 

In Amber, we implement ann with GATE, relying on 
a mixture of manually crafted and automatically extracted 
gazetteers, taken from sources such as DBPedia |2], along 
with regular expressions for prices, postcodes, etc. In Sec- 
tion [6] we show that Amber easily compensates even for 
very low quality annotators, thus requiring only little effort 
in creating these annotators. 



3.3 Data Area Identification 

We overcome the mutual dependency of data area, record, 
and attribute in approximating the regular record through 
instances of the pivot attribute type 71 ': For each record, we 
aim to identify a single pivot node containing that record at- 
tribute 71 (R4). A data area is then a cluster of pivot nodes 
appearing regularly, i.e., the nodes occur have roughly the 
same depth (D4) and a pairwise similar distance (D5). 

Let N n be a set of pivot nodes, i.e., for each neN K there 
is some v such that ann(;r,ft,v) holds. Then we turn prop- 
erties (D4) and (D5) into two corresponding regularity mea- 
sures for N n : N n is (M4) S depth -depth consistent, if there ex- 
ists a k such that depth (ri) =k±0 6epth for all n e N„, and N n 
is (M5) G d,st -distance consistent, if there exists a k such that 
| path (n, n') \ =k±0 6lsl for all n^ri eN n . Therein, depth (n) 
denotes the depth of n in the DOM tree, and | path (ft, ft 7 ) | 
denotes the length of the undirected path from ft to ft 7 . As- 
suming some parametrization ® de P th and ® dlst , we derive our 
definition of data areas from these measures: 

Definition 4 A data area (for a regular attribute type 7i) is 
a maximal subtree d in a DOM P where 

(1) d contains a set of pivot nodes N n with |^ | > 2, 

(2) N K is depth and distance consistent (M4-5), 

(3) N K is maximal (Dl) and continuous (D2), and 

(4) d is rooted at the least common ancestor of N n . 
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Algorithm 2: identify^, T, 71, ann, pivots) 

input : P - DOM to be analyzed 
output : T - extraction typing on P with data areas only 
input : 7T - the pivot attribute type % G Zr 
input : ann - annoations on P 
output : pivots - data areas support 

1 PivoTs(rc) <- for all n e P\ 

2 CandDAs ^{ ({n}, [depth (n), depth (n)], 

3 CandDAs.£K/J(0,[O,oo],[O,oo]); 

4 LastDA = (NodesLastDA, Depth LastDA , Dist Las tDA) <- (0, [],[]); 

5 foreach (Nodes, Depth, Dist) <E CandDAs in document order do 

6 Depth' «— Depth LastDA bbJ Depth; 

7 Dist' ^— DistLastDA 1 ^ Dist WpathLengths(NodesLastDA, Nodes); 
if | Depth' | < S depth and |Dist'| < dist then 

/* Cluster can be extended further */ 
LastDA <- (NodesLastDA U Nodes, Depth', Dist'); 



o,0]) | ann(7T,^, v)}; 



n 

12 
13 
14 



else 



/* Cluster cannot be extended further */ 
if | NodesLastDA | > 2 then 
d «— lca(NodesLastDA); 
if | pivots (d) | < | NodesLastDA | then 
[_ pivots(J) «— NodesLastDA; add T(d,d); 

LastDA 4- (Nodes, Depth, Dist); 



Algorithm [2] shows Amber's approach to identifying 
data areas accordingly. The algorithm takes as input a DOM 
tree P, an annotation relation ann, and a pivot attribute type 
71. As a result, the algorithm marks all data area roots n G P 
in adding T(ft, d) to the extraction typing T. In addition, the 
algorithm computes the support of each data area, i.e., the 
set of pivot nodes giving rise to a data area. The algorithm 
assigns this support set to pivots for use by the the sub- 
sequent record segmentation. 

The algorithm clusters pivot nodes in the document, 
recording for each cluster the depth and distance interval 
of all nodes encountered so far. Let / = [i\ , 12] and / = 
[71,72] be two such intervals. Then we define the merge 
of I and /, / l±J J = [min(/i , j\) , max(z2, 72)] • A (candidate) 
cluster is given as tuple (Nodes, Depth, Dist) where Nodes 
is the clustered pivot node set, and Depth and Dist are the 
minimal intervals over No, such that depth (ft) G Depth and 
| path (ft, ft') | G Dist holds for all ft, ft 7 G Nodes. 

During initialization, the algorithm resets the support 
pivots (ft) for all nodes n G P (Line [I]), turns all pivot 
nodes into a candidate data areas of size 1 (Line [2]), 
and adds a special candidate data area (0, [0, 00] JO, 00]) 
(Line [3} to ensure proper termination of the algorithm's 
main loop. This data area is processed after all other data 
areas and hence forces the algorithm in its last iteration 



into the else branch of Line 11 (explained below). Before 
starting the main loop, the algorithm initializes LastDA = 
(NodesLastDA, Depth LastDA , DistLastDA) to hold the data area 
constructed in the last iteration. This data area is initially 
empty and set to (0, [],[]) (Line|5]). 




/ / / 



Fig. 6: Data area identification 



After initialization, the algorithm iterates in document 
order over all candidate data areas (Nodes, Depth, Dist) in 
CandDAs (Line [5]). In each iteration, the algorithm tries to 
merge this data area with the one constructed up until the 
last iteration, i.e., with LastDA. If no further merge is pos- 
sible, the resulting data area is added as a result (if some 
further property holds). To check whether a merge is pos- 
sible, the algorithm first merges the depth and distance in- 
tervals (Lines [6] and [7] respectively). The latter is computed 
by merging the intervals from the clusters with a third one, 
path Lengths, the interval covering the path lengths between 
pairs of nodes from the different clusters (Line [7]). If the 
new cluster is still depth -depth and © dlst -distance consis- 
tent (Lines [8}, we merge the current candidate data area into 
LastDA and continue (Line [9]). 

Otherwise, the cluster LastDA cannot be grown further. 
Then, if LastDA contains at least 2 nodes (Line [TT] ), we 
compute the representative d of LastDA as the least com- 
mon ancestor lca(NodesLastDA) of the contained pivot nodes 
NodesLastDA (Line[T2]). If this representative d is not already 
bound to another (earlier occurring) support set of at least of 



the same size (Line 13), we assign NodesLastDA as new sup- 
port to PivoTs(d) and mark d as dataarea by adding T(d, d) 
(Line [14]). At last, we start a to build a data area with the 
current one (Nodes, Depth, Dist). The algorithm always en- 
ters this else branch during its last iteration to ensure that 
the very last data area's pivot nodes are properly considered 
as a possible support set. 

Theorem 1 The set of data areas for a DOM P of size n 
under schema E and pivot attribute type % is computed in 
0(n 2 ). 



Proof Lines [T]-[4] iterate twice over the DOM and are there- 
fore in 0(n). Lines [5- 15 are in 0(n 2 ), as the loop is dom- 
inated by the computation of the distance intervals. For the 
distance intervals, we extend the interval by the maximum 
and minimum path length between nodes from NodesLastDA 
and Nodes and thus compare any pair of nodes at most once 
(when merging it to the previous cluster). □ 
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To illustrate Algorithm [5J consider Figure [6] with dlst = 
£)depth _ 3 Yellow diamonds represent the data areas £>i ,£>2, 
and D3, and red triangles pivot nodes. With this large thresh- 
olds the algorithm creates one cluster at D\ with M\ \io 
as support, despite the homogeneity of the subtree rooted at 
E and the "loss" of the three rightmost pivot nodes in E. In 
Section [6j we show that the best results are obtained with 
smaller thresholds, viz. diSt = 2 and ® depth = 1, which in- 
deed would split D\ in this case. Also note, that D2 and D3 
are not distance consistent and thus cannot be merged. Small 
variations in depth and distance, however, such as in D2 do 
not affect the data area identification. 



3.4 Record Segmentation 

During the data area identification, Amber identifies data 
areas of a page, marks their roots d with T(d,d), and pro- 
vides the pivot nodes pivots (J) supporting the data area, 
with its pivot nodes occurring roughly at the same depth 
and mutual distance. As in data area identification, Am- 
ber approximates the occurrence of relevant data and struc- 
tural record similarity through instances of regular attribute 
types Er (R4) to construct a set of candidate segmentations. 
Hence, only the records in these candidate segmentations 
must be checked for mutual structural similarity (R3), al- 
lowing Amber to scale to large and complex pages at ease. 

Definition 5 A record is a set r of continuous children of 
a data area d (Rl), such that r contains at least one pivot 
node from pivots (J) (R4). A record segmentation of d is 
a set of non-overlapping records 1Z of uniform size (R2). 
The quality of a segmentation 1Z improves with increasing 
size (Dl) and decreasing irregularity (R3). 

Given a data area root d and its pivot nodes pivots (J), 
this leads to a dual objective optimization problem, striving 
for a maximal area of minimal irregularity. We concretize 
this problem with the notion of leading nodes: Given a pivot 
node n G pivots (J), we call the child / of d, containing n 
as a descendant, the leading node / of n. Accordingly, we 
define leadings (d) as the set of leading nodes of a data 
area rooted at d. To measure the number of siblings of / po- 
tentially forming a record, we compute the leading space 
lspace(/,L) after a leading node / G L as the sibling distance 
I — sibi l\ where I' G L is the next leading node in document 
order. The two objectives for finding an optimal record seg- 
mentation 1Z are then as follows: 

(1) Maximize the subset 1Z' C 1Z of records that are evenly 
segmented (Dl). A subset 1Z' = {r\ , . . . r^} is evenly seg- 
mented if each record r ; G 1Z' contains exactly one pivot 
node n\ G pivots (d) (R4), and all leading nodes U corre- 
sponding to a pivot node ni have the same leading space 
lspace(/;,i_EADiNGs(d)) (Rl-3). 



(2) Minimize the irregularity of the record segmentation 
(R3). The irregularity of a record segmentation 1Z 
equals the summed relative tree edit distances be- 
tween all pairs of nodes in different records in 1Z, 
i.e., irregularity(^) = Lner^er" with en e ^ D ' ls K n ^ f )^ 
where editDist(^,n / ) is the standard tree edit distance 
normalized by the size of the subtrees rooted at n and 
n' (their "maximum" edit distance). 

Amber approximates such a record segmentation with 
Algorithm [3] It takes as input a DOM P, a data area root 
d G P, and accesses the corresponding support sets via 
pivots (J), as constructed by the data area identification al- 
gorithm of the preceding section. The segmentation is com- 
puted in two steps, first searching a basic record segmen- 
tation that contains a large sequence of evenly segmented 
pivot nodes, and second, shifting the segmentation bound- 
aries back and forth to minimize the irregularity. In a pre- 
processing step all children of the data area without text or 
attributes ("empty" nodes) are collapsed and excluded from 
the further discussion, assuming that these act as separator 
nodes, such as br nodes. 

So, the algorithm initially determines the sequence C of 
leading nodes underlying the segmentation (Line [2]). Based 
on these leading nodes, the algorithm estimates the distance 
Len between leading nodes (Line [3]) that yields the largest 
evenly segmented sequence: We take for Len the shortest 
leading space Ispace(Z) among those leading spaces occur- 
ring most often in C. Then we deal with noise prefixes in 
removing those leading nodes 4 from the beginning of C 
which have lspace(4) smaller than Len (Line 3]5). After 
dealing with the prefixes, we drop all leading nodes from C 
whose sibling distance to the previous leading node is less 
than Len (Lines [6][7]). This loop ensures that each remaining 
leading node has a leading space of at least Len and takes 
care of noise suffixes. 

With the leading nodes C as a frame for segmenting 
the records, the algorithm generates all segmentations with 
record size Len such that each record contains at least one 
leading node from C. To that end, the algorithm computes 
all possible sets StartCandidates of record start points for 
these records by shifting the original leading nodes C to 
the left (Line [8]). The optimal segmentation 1Z op t is set to 
the empty set, assuming that the empty set has high irreg- 
ularity (Line [9]). We then iterate over all such start point 
sets S (Line [TO]) and compute the actual segmentations 1Z 
as the records of Len length, each starting from one starting 
point in S (LinefTT]). By construction, these are records, as 
they are continuous siblings and contain at least one lead- 
ing node (and hence at least one pivot node). The whole 
Segmentation is a record segmentation as its records are non- 
overlapping (because of Line [6|[7| ) and of uniform size Len 
(Line [12]). From all constructed segmentations, we choose 
the one with the lowest irregularity (Lines T3p4| ). At last, 
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Algorithm 3: segment (P,T, pivots) 



input 
input 
input 
modifies 



P - DOM to be analyzed 
T - extraction typing on P 
pivots - data areas support 
T - adds record segmentations 



1 foreach d G P : T(d, d) do 



2 
3 

4 

5 

6 
7 

8 
9 
10 
11 
12 
13 
14 

15 
16 
17 



£ <(— LEADINGS(J); 

Len «— min{lspace(/,£): / G £ with maximal 
|{Z' G £ : lspace(/,£) = lspace(/',£)}|}; 
while Ik € £ in document order, lspace(/&, £) < Len do 
L delete /& from £; 

for Ik € £ in document order do 
L if lspace(4,£) < Len then delete-skip 4 + i from £; 

StartCandidates ^ {{n:3l e £:n — sib | / = : < i < Len}; 

Kopt <- 0; 



sGS}; 



foreach S G StartCandidates do 

ft = {{ n :s- S fo\n< Len} : 
if VrG'fc: \r\ = Len then 

if irregularity(T^) < irregularity(7£ opf ) 



then 



L n opt ^ft; 



foreach r G 7£ ^ do 



foreach Afode zij G r m document order do 
[_ addT(/i/,rt); 

add T(wi,rs); 



we iterate through all records r in the optimal segmentation 
lZ p t (Line [15]), and mark the first node n G r as record start 
with T(n,rs) (Line 18) and all remaining nodes ft £ r as 



record tail with T (ft, rt) (Line p^TT] ). 

Theorem 2 Algorithm^runs in 0(b-n 3 ) on a data area d 
with b as degree of d and n as size of the subtree below d. 

Proof Lines [2|8] are in 0(b 2 ). Line [^generates in StartCan- 
didates at most b segmentations (as Len < b) of at most b 
size. The loop in Lines [T0|[T4| is executed once for each seg- 
mentation S G StartCandidates and is dominated by the com- 
putation of irregularity () which is bounded by 0(n 3 ) using a 
standard tree edit distance algorithm. Since b < ft, the over- 
all bound is 0(b 2 + b-n 3 =b-n 3 ). □ 



In the example of Figure [7] Amber generates five seg- 
mentations with Len = 4, because of the three (red) div 
nodes, occurring at distance 4. Note, how the first and last 
leading nodes (p elements) are eliminated (in Lines 4]|7) as 
they are too close to other leading nodes. Of the five seg- 
mentations (shown at the bottom of Figure [7]), the first and 
the last are discarded in Line[l2| as they contain records of 
a length other than 4. The middle three segmentations are 
proper record segmentations, and the middle one (solid line) 
is selected by Amber, because it has the lowest irregularity 
among those three. 



3.5 Attribute Alignment 

After segmenting the data area into records, Amber aligns 
the contained attributes to complete the extraction instance. 
We limit our discussion to single valued attributes, i.e., at- 
tribute types which occur at most once in each record. In 
contrast to other data extraction approaches, Amber does 
not need to refine records during attribute alignment, since 
the repeating structure of attributes is already established in 
the extraction typing. It remains to align all attributes with 
sufficient cross-record support, thereby inferring missing at- 
tributes, eliminating noise ones, and breaking ties where an 
attribute occurs more than once in a single record. 

When aligning attributes, Amber must compare the po- 
sition of attribute occurrences in different records to de- 
tect repeated structures (R3) and to select those attribute in- 
stances which occur at similar relative positions within the 
records (R6). To encode the position of an attribute relative 
to a record, we use the path from the record node to the at- 
tribute: 

Definition 6 For DOM nodes r and n with descendant(r, ft), 
we define the characteristic tag path tag-path r (ft) as the se- 
quence of HTML tags occurring on the path from r to ft, 
including those of r and n itself, taking only first-child and 
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next-sibl steps while skipping all text nodes. With the excep- 
tion of r's tag, all HTML tags are annotated by the step type. 

For example, in Figure [8] the characteristic tag path 
from the leftmost a and to its i descendant node 
is a/first-child: : p/f irst-child : : span/next- sibl : :i. 
Based on characteristic tag paths, Amber quantifies the as- 
sumption that a node n is an attribute of type p G E within 
record r with support supper, n,p). 

Definition 7 Let T be an extraction typing on DOM P with 
nodes d,r,n G P where n belongs to record r, and r belongs 
to the data area rooted at d. Then the support supper, n, p) 
for n as attribute instance of type p G E is defined as 
the fraction of records r' in d that contain a node n' with 
tag-path r (ft) = tag-path^ (n') and T(n',p) for arbitrary v. 

Consider a data area with 10 records, containing 1 
PRiCE-annotated node n\ with tag path div/. . ./next-sibl ::span 
within record n, and 3 PRiCE-annotated nodes ^2---^4 
with tag path div/. . . /first-child::p within records r^ . . . 7*4, resp. 
Then, supp7-(ri,Hi, price) =0.1 and supp 7 -(r i -,^-, price) = 
0.3 for 2 < i < 4. 

With the notion of support at hand, we define our crite- 
rion for an acceptable extraction typing T - which we use to 
transform incomplete and noise annotations into consistent 
attributes: We turn annotations into attributes if the support 
is strong enough, and with even stronger support, we also 
infer attributes without underlying annotation. 

Definition 8 An extraction typing T over schema E = 
E R U E and DOM P is well-supported, if for all nodes n 
with T(ft,p), one of the following two conditions is satis- 
fied - setting X = R for p G E R and X = O for p G E : 
(1) supp r (r,n,p) > 0^ nfer , or (2) supp r (r,n,p) > <S)£ eep and 
ann(p,?z,v). 

This definition introduces two pairs thresholds, <^ nfer , <^ eep 
and @q qx , @^ eep , respectively, for dealing with regular and 
optional attribute types. In both cases, we require 0^ fer > 
0^ eep , as inferring an attribute without an annotation re- 
quires more support than keeping a given one. We also as- 
sume that 0^ nfer > ®£ fer , i.e., that optional attributes are eas- 
ier inferred, since optional attributes tend to come with more 
variations (creating false negatives) (A3). Symmetrically, 
we assume 0^ eep > 0^ eep , i.e., that optional attributes are 
easier dropped, optional attributes that are not cover by the 
template (R5) might occur in free-text descriptions (creating 
false positives). Taken together, we obtain ®^ nfer > ®^ fer > 
0^ eep > 0^ eep . See Section^for details on how we set these 
four thresholds. 

We also apply a simple pruning technique prioritizing 
early occurrences of attributes (Al), as many records start 
with some semi- structured attributes, followed by a free-text 
description. Thus earlier occurrences are more likely to be 



Algorithm 4: align (P,T,E) 



input : P - DOM to be analyzed 
input : T - extraction typing on P 
input : E = LrUEo - schema of the searched results 
modifies : T - adds attributes 
1 foreach p in E do 



- @^ fer ; @ keep - 



select X with p e E x : infer 
foreach n,r eP with partOf r (w, r) do 

if supp7-(r,w,p) > @' nfer or (^ann(p 1 n,v) and 
supp £ (r,^,p) >0 kee P) then add T(n,p); 

foreach n with p G T(n) do 

if 3n' : T(n',p) and following (n',ri) then 
|_ remove T(n.p)\ 
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Fig. 8: Attribute alignment 

structured attributes rather than occurrences in product de- 
scriptions. As shown in Section [6j this simple heuristic suf- 
fices for high-accuracy attribute alignment. For clarity and 
space reasons, we therefore do not discuss more sophisti- 
cated attribute alignment techniques. 

Algorithm [4] shows the full attribute alignment algo- 
rithm and presents a direct implementation of the well- 
supportedness requirement. The algorithm iterates over all 
attributes in the schema p G E = Er U Eq (Line [T]) and se- 
lects the thresholds ® infer and ® keep depending on whether 
p is regular or optional (Line [2]). Next, we iterate over all 
nodes n which are part of a record r (Line [3]). We assign the 
attribute type p to n, if the support suppy- (r,fi,p) for n hav- 
ing type p is reaching either the inference threshold ® infer or 
the keep threshold ® keep , requiring additionally an annota- 
tion ann(p,ft,v) in the latter case (Line [5]). After finding all 
nodes n with enough support to be typed with p, we remove 
all such type assignments except for the first one (Lines Bp]). 



Theorem 3 Amber's attribute alignment (Algorithm^ 
computes a well- supported extraction instance for a page 
with DOM Pin 0(\E\-\P\). 



In Figure [8] we illustrate attribute alignment in Amber 
for ® infer = 40% for both regular and optional attribute types 
and <^ eep = 0%, 0* eep = 30% (price and location reg- 
ular, beds optional): The data area has four records each 
spanning two of the children of the data area (shown as 
blue diamonds). Red triangles represent attributes with the 
attribute type written below. Other labels are HTML tags. 
A filled triangle is an attribute directly derived from an 
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annotation, an empty triangle one inferred by the algo- 
rithm in Line 6. In this example, the second record has no 
price annotation. However, there is a span with tag path 
a/first -child : : p/first- child : : span and there are two 
other records (the first and third) with a span with the same 
tag path from their record. Therefore that span has support 
> © Infer = 40% for price and is added as a price attribute 
to the second record. Similarly, for the b element in record 1 
we infer type location from the support in record 2 and 4. 
Record 3 has a location annotation, but in an em. This has 
only 25% support, but since location is regular that suf- 
fices. This contrasts to the i in record 1 which is annotated 
as beds and is not accepted as an attribute since optional at- 
tributes need at least 0^ eep = 30% support. In record 4 the 
second price annotation is ignored since it is the second in 
document order (Lines 7-8). 

3.6 Running Example 

Recall Figure [T] in Section |1.1| showing the web page of 
rightmove .co.uk, an UK real estate aggregator, which we 
use as running example: It shows a typical result page with 
one data area with featured properties (1), a second area with 
regular search results (2), and a menu offering some filtering 
options (3). 

For this web page, Figure [9] shows a simplified DOM 
along with the raw annotations for the attribute types price, 
bed RoomN umber, and location, as provided by our anno- 
tation engine (for simplicity, we do not consider the bath- 
RoomNumber shown on the original web page). Aside the 
very left nodes in Figure [9] belonging to the filter menu, 
the DOM consists of a single large subtree with annotated 
data. The numbered red arrows mark noise or missing an- 
notations - to be fixed by Amber: (1) This node contains 
indeed a price, but outside any record: It is the average rent 
over the found results, occurring at the very top of Figure [T] 

(2) The location annotation in the third record is missing. 

(3) The second price in this record is shown crossed out, 
and is therefore noise to be ignored. (4) This bedroom num- 
ber refers to a flat to sublet within a larger property and is 
therefore noise. 

Data Area Identification. For identifying the data areas, 
shown in Figure 10 Algorithm [2] searches for instances of 
the pivot attribute type - price in this case. Amber clus- 
ters all pivot nodes which are depth and distance consis- 
tent for de P th = © dlst = 1 into one data area, obtaining the 
shown Areas 1 and 2. The price instance to the very left 
(issue (1) named above) does not become part of a cluster, 
as it its distance to all other occurrences is 6, whereas the 
occurrence inside the two clusters have mutual distance 4, 
with 4 + ® dlst < 6. For same reason, the two clusters are not 
merged, as the distance between one node from Area 1 and 



one from Area 2 is also 6. The data area is then identified 
by the least common ancestor of the supporting pivot nodes, 
called the data area root. 

Record Segmentation. The record segmentation in Algo- 
rithm [3] processes each data areas in isolation: For a given 
area, it first determines the leading nodes corresponding to 
the pivot nodes, shown as solid black nodes in Figure [TO 
The leading node of a pivot node is the child of the data 
area root which is on the path from the area root to the pivot 
node. In case of Area 1 to the left, all children of the area root 
are leading nodes, and hence, each subtree rooted at a lead- 
ing nodes becomes a record in its own right, producing the 



segmentation shown to the left of Figure 1 1 The situation 
within Area 2 is more complicated: Amber first determines 
the record length to be 2 sibling children of the area root, 
since in most cases, the leading nodes occur in a distance 



of 2, as shown in Figure 10 Having fixed the record length 
to 2, Amber drops the leading nodes which follow another 
leading node too closely, eliminating the leading node cor- 
responding to the noisy price in the second record (issue (3) 
from above). Once the record length and the resulting lead- 
ing nodes are fixed, Algorithm [3] shifts the records bound- 
aries to find the right segmentation, yielding two alterna- 
tives, shown on the right of Figure [TT] In the upper variant, 
only the second and fourth record are similar, the first and 
third record deviate significantly, causing a lot of irregular- 
ity. Hence, the lower variant is selected, as its four records 
have a similar structure. 

Attribute Alignment. Algorithm [4] fixes the attributes of the 
records, leading to the record structure shown in lower half 



of Figure 12 This infers the missing location and cleans the 
noisy price (issues (2) and (4) from above). One the upper 



left of Figure [T2j we show the characteristic tag path for lo- 
cation is computed, resulting in a support of 2/3, as we have 
2 location occurrences at the same path within 3 records - 
with e.g. ®q™ = 50% enough to infer the location attribute 
without original annotation. On the upper right of Figure 12 
we show how the noisy price in the third record is eliminat- 
ing: Again, the characteristic tag paths are shown, leading to 
a support of 1/4 - with e.g. 0^ eep = 30% too low to keep 
the bed Room Number attribute. The resulting data area and 



record layout is shown in the bottom of Figure 12 



4 Building the Domain Knowledge 

In Amber we assume that the domain schema is provided 
upfront by the developer of the wrapper. In particular, for 
a given extraction task, the developer must specify only the 
schema E = Er U Eo of regular and optional attribute types, 
using the regular attribute types as strong indicators for the 
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Fig. 10: Data area identification on rightmove .co.uk 
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Fig. 11: Record Segmentation on rightmove .co.uk 







to 












Area 1 


Area 2 







ill! 


11 


Ob 



Fig. 12: Attribute Alignment on rightmove .co.uk 



presence of a E entity on a webpage. In addition, the devel- 
oper can also specify disjointness constraints p\ A p2 — » _L 
for two attribute types pi , P2 G £ to force the domains of p\ 
and p2 to be disjoint. 

As mentioned earlier, devising basic gazetteers and reg- 
ular expressions for core entities of a given domain re- 
quires very little work thanks to frameworks like GATE [11] 
and openly available knowledge repositories such as DB- 
Pedia |2] and FreeBase [5]. Values that can be recognised 
with regular expressions are usually known a priori, as they 
correspond to common-sense entities, e.g., phone numbers 
or monetary values. On the other hand, the construction of 
gazetteers, i.e., sets of terms corresponding to the domains 
for attribute types (see Section[2]), is generally a tedious task. 
While it is easy to construct an initial set of terms for an 



attribute type, building a complete gazetteer often requires 
an exhaustive analysis of a large sample of relevant web 
pages. Moreover, the domains of some attribute types are 
constantly changing, for example a gazetteer for song titles 
is outdated quite quickly. Hence, in the following, we focus 
on the automatic construction and maintenance of gazetteers 
and show how Amber's repeated-structure analysis can be 
employed for growing small initial term sets into complete 
gazetteers. 

This automation lowers the need and cost for domain 
experts in the construction of the necessary domain knowl- 
edge, since even a non-expert can produce basic gazetteers 
for a domain to be completed by our automated learning 
processes. Moreover, the efficient construction of exhaus- 
tive gazetteers is valuable for other applications outside web 
data extraction, e.g., to improve existing annotation tools or 
to publish them as linked open data for public use. 

But even if a gazetteer is curated by a human, the result- 
ing annotations might still be noisy due to errors or intrinsic 
ambiguity in the meaning of the terms. Noise-tolerance is 
therefore of paramount importance in repairing or discard- 
ing wrong examples, given enough evidence to support the 
correction. To this end, Amber uses the repeated structure 
analysis to infer missing annotations and to discard noisy 
ones, incrementally growing small seed lists of terms into 
complete gazetteers, and proving that sound and complete 
initial domain knowledge is, in the end, unnecessary. 

Learning in Amber can be carried in two different 
modes: (1) In upfront learning, Amber produces upfront 
domain knowledge for a domain to bootstrap the self- 
supervised wrapper generation. (2) In continuous learning, 
Amber refines the domain knowledge over time, as Am- 
ber extracts more pages from websites of a given domain 
of previously unknown terms from nodes selected within the 
inferred repeated structure. Regardless the learning mode, 
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the core principle behind Amber's learning capabilities is 
the mutual reinforcement of repeated- structure analysis and 
the automatic annotation of the DOM of a page. 

For the sake of explanation, a single step of the learning 
process is described in Algorithm]?] To update the gazetteers 
in U = {U\, . . . ,£/&} from an extraction typing T and the 
corresponding annotations ann, for each node n, we compare 
the attribute types of n in T with the annotations ann for n. 
This comparison leads to three cases: 

(1) Term validation: n is a node attribute for p and car- 
ries an annotation ann(p,n,v). Therefore, v was part of the 
gazetteer for p and the repeated- structure analysis confirmed 
that v is in the domain Up of the attribute type. 

(2) Term extraction: n is a node attribute for p but it 
does not carry an annotation ann(p,^,v). Therefore, Am- 
ber should consider the terms in the textual content of n for 
adding to the domain Up . 

(3) Term cleaning: The node carries an annotation 
ann(p,ft,v) but does not correspond to an attribute node for 
p in T, i.e., is noise for p. Therefore, Amber must consider 
whether there is enough evidence to keep v in Up . 

For each attribute node n in the extraction typing T, 
Amber applies the function components to tokenize the tex- 
tual content of the attribute node n to remove unwanted to- 
ken types (e.g., punctuation, separator characters, etc.) and 
to produce a clean set of tokens that are likely to repre- 
sent terms from the domain. For example, assume that the 
textual content of a node n is the string w="Oxford, Wal- 
ton Street, ground- floor apartment". The application of the 
function components produces the set ( "Oxford", "Walton 
Street", "ground-floor", "apartment") by removing the com- 
mas from w. 

Amber then iterates over all terms that are not already 
known to occur in the complement Up of the domain of the 
attribute type p and decides whether it is necessary to val- 
idate or add them to the set of known values for p . A term 
v is in Up if is either known from the schema that v G U p > 
and Z |= p A p' ^ ±, orv has been recurrently identified 
by the repeated- structure analysis as noise. Each term v has 
therefore an associated value ev(v, Up) (resp. ev(v, Up)) rep- 
resenting the evidence of v appearing — over multiple learn- 
ing steps — as a value for p (resp. as noise for p). 

If Amber determined that a node n is an attribute node 
of type p but no corresponding annotation ann(p,^,v) ex- 
ists, then we add them to the domain Up. Moreover, once 
the term v is known to belong to Up we simply increase its 
evidence by a factor freq + (v,p,T) that represent how fre- 
quently v appeared as a value of p in the current extraction 
typing T. The algorithm then proceeds to the reduction of 
the noise in the gazetteer by checking those cases where 
an annotation ann(p,n,v) is not associated to any attribute 
node in the extraction typing, i.e., it is noise for p. Every 
time a term v is identified as noise we increase the value of 
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Fig. 14: System architecture 

ev(v,C/p) of a factor freq - (v,p,T) that represents how fre- 
quently the term v occur as noise in the current typing T. To 
avoid the accumulation of noise, Amber will permanently 
add a term v to U p if the evidence that v is noisy for p is 
at least G times larger that the evidence that v is a genuine 
value for p. The constant G is currently set to 1.5. 

To make the construction of the gazetteers even 
smoother, Amber also provides a graphical facility (see 



Figure 13 ) that enables developers to understand and possi- 
bly drive the learning process. Amber's visual component 
provides a live graphical representation of the result of the 
repeated- structure analysis on individual pages and the posi- 
tion of the attributes (1). Amber relates the concepts of the 
domain schema (3), e.g., location and property-type, with 
(3) the discovered terms, providing also the corresponding 
confidence value. The learning process is based on the anal- 
ysis of a selected number of pages from a list of URLs (4). 
The terms that have been identified on the current page and 
have been validated are added to the gazetteer (5). 



5 System Architecture 



Figure 14 shows Amber's architecture composed of mainly 
of three layers. The Browser Layer consists of a JAVA API 
that abstracts the specific browser implementation actually 
employed. Through this API, currently Amber supports 
a real browser like Mozilla Firefox, as well as a headless 
browser emulator like HTMLUnit. Amber uses the browser 
to retrieve the web page to analyze, thus having direct access 
to its DOM structure. Such DOM tree is handed over to the 
Annotator Layer. This is implemented such that different an- 
notators can be plugged in and used in combination, regard- 
less their actual nature, e.g., web-service or custom stan- 
dalone application. Given an annotation schema for the do- 
main at hand, such layer produces annotations on the input 
DOM tree using all registered annotators. Further, the pro- 
duced annotations are reconciliated w.r.t. constraints present 
in the annotation schema. Currently, annotations in Am- 
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£850 pem 

4 bedroom apartment to rent 
High Street. Witney 

Gatekeeper-Spi1ng*eld Letting agents in Witney are pleased to offer this spacious 4 double 
bedroom second floor apartment. Fantastically located being just a stones throw from the 
main town centre, the property comprises of living^dining room, kitchen with appliances. 4 
double bedrooms and fami... 

i_Li_ ii < . 

Marketed by Gatekeeper-Springfield. Local. Telephone: 0841 314 ij;i27 
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area. The property benefits from three bedrooms, bathroom, kitchen with appliances, lounge 
and shower room. Rear garden with rural views. 



More details | Save property | Contact agent | Upgrade listing 

Marketed by Tobin Jones Property. Bicester. Telephone: 0843 314 5205 BT4 | 
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More details and 5 intact agent | Upgrade listing 

Marketed by Locks, Wallingford. Telephone: 0343 314 9 30 6 BT4p..'min 
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BER are performed by using a simple GATE (gat e . ac . uk| ) 
pipeline consisting of gazetteers of terms and transducers 
(JAPE rules). Gazetters for real estate and used cars domains 
are either manually-collect (for the most part) or derived 
from external sources such as DBPedia and Freebase. Note 
that many types are common across domains (e.g., price, lo- 
cation, date), and that the annotator layer allows for arbitrary 
entity recognisers or annotators to be integrated. 



With the annotated DOM at hand, Amber can begin its 
analysis with data area identification, record segmentation 
and attribute alignments. Each of these phases is a distinct 
sub-module, and all of them are implemented in Datalog 
rules on top of a logical representation of the DOM and its 
annotations. These rules are with finite domains and non- 
recursive aggregation, and executed by the engine DLV. 



As described in Section [3] the outcome of this analy- 
ses is an extraction typing T along with attributes and rela- 
tive support. During Amber's bootstrapping, however, T 
is in turn used as feedback to realize the learning phase 
(see Sect. [4j managed by the Annotation Manager module. 
Here, positive and negative lists of candidate terms is kept 
per each type, and used to update the initial gazetteers lists. 
The Annotation Manager is optionally complemented with 
a graphical user interface, implemented as an Eclipse plugin 
(eclipse .org) which embeds the browser for visualization. 



6 Evaluation 

Amber is implemented as a three-layer analysis engine 
where (1) the web access layer embeds a real browser 
to access and interact with the live DOM of web pages, 
(2) the annotation layer uses GATE ifTTIl along with do- 
main gazetteers to produce annotations, and (3) the reason- 
ing layer implements the actual Amber algorithm as out- 
lined in Section [3] in datalog rules over finite domains with 
non-recursive aggregation. 



6.1 AMBER in the UK 

We evaluate Amber on 150 UK real-estate web sites, ran- 
domly selected among 2810 web sites named in the yellow 
pages, and 100 UK used car dealer websites, randomly se- 
lected from UK's largest used car aggregator autotrader . | 
|co. uk| To assure diversity in our corpus, in case two sites 
use the same template, we delete one of them and randomly 
choose another one. For each site, we obtain one, or if possi- 
ble, two result pages with at least two result records. These 
pages form the gold standard corpus, that is manually an- 
notated for comparison with Amber. For the UK real es- 
tate, the corpus contains 281 pages with 2785 records and 
14614 attributes. The used car corpus contains 151 pages 
with 1 608 records and 12732 attributes. 

For the following evaluations we use threshold values as 

depth = tj dist = 2 and ^infer = Q^_ = 50 %, 0^ = 0%, 



and 0^ eep = 20%. Figures 15a and 15b show the overall pre 
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Fig. 17: Amber Robustness wrt. Noise in Gazetteers 

cision and recall of Amber on the real estate and used car 
corpora. As usual, precision is defined as the fraction of rec- 
ognized data areas, records, or attributes that are also present 
in the gold standard, whereas recall as the fraction of all data 
areas, records, and attributes in the gold standard that is re- 
turned by Amber. Amber achieves outstanding precision 
and recall on both domains (> 98%). If we measure the av- 
erage precision and recall per site (rather than the total pre- 
cision and recall), pages with fewer records have a higher 
impact. But even in that harder case, precision and recall re- 
mains above 97.5%. 

Robustness. More importantly, Amber is very robust both 
w.r.t. noise in the annotations/structure and w.r.t. the number 
of repeated records per page. To give an idea, in our corpus 
50% of pages contain structural noise either in the beginning 



or in the final part of the data area. Also, 70% of the pages 
contain noisy annotations for the price attribute, that is used 
as regular attribute in our evaluation. On average, we count 
about 22 false occurrences per page. Nonetheless, Amber 
is able to perform nearly perfect accuracy, fixing noise both 
from structure and annotations. Even worse, 100% of pages 
contain noise for the Location (i.e., addresses/locality, no 
postcode) attribute, which on average amounts to more than 
50 (false positive) annotations of this type per page. To 
demonstrate how Amber copes with noisy annotations, we 



show in Figure 17 the correlation between the noise lev- 
els (i.e., errors and incompleteness in the annotations) and 
Amber's performance in the extraction of the location at- 
tribute. Even by using the full list of locations, about 20% 
of all annotations are missed by the annotators, yet Amber 
achieves > 98% precision and recall. If we restrict the list 
to 75%, 50%, and finally just 25% of the original list, the 
error rate rises over 30% and 60% to 78%. Nevertheless, 
Amber's accuracy remains nearly unaffected dropping by 
only 3% to about 95% (measuring here, of course, only the 
accuracy of extraction location attributes). In other words, 
despite only getting annotations for one out of every five lo- 
cations, Amber in able to infer the other locations from the 
regular structure of the records. Amber remains robust even 
if we introduce errors for more than one attribute, as long as 
there is one regular attribute such as the price for which the 
annotation quality is reasonable. This distinguishes Amber 
from all other approaches based on automatic annotations 
that require reasonable quality (or at least, reasonable re- 
call). Amber, achieves high performance even from very 
poor quality annotators that can be created with low effort. 
At the same time, Amber is very robust w.r.t. the num- 



ber of records per page. Figure 16 illustrates the distribution 
of record numbers per page in our corpora. They mainly 
range from 4 to 20 records per page, with peaks for 5 and 
10 records. Amber performs well on both small and large 
pages. Indeed, even in the case of only 3 records, it is able to 
exploit the repeated structure to achieve the correct extrac- 
tion. 

Distance, depth, and attribute alignment thresholds can 
influence the performance of Amber. However, it is 
straightforward to choose good default values for these. For 
instance, considering the depth and distance thresholds, Fig- 
~ shows that the pair (@ de P th = l,0 dist = 2) provides 
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significantly better performance than (0,0) or (2,4). 

Attributes. As far as attributes are concerned, there are 9 dif- 
ferent types for the real estate domain, and 12 different types 
for the used car corpus. First of all, in 96% of cases Amber 
perfectly recognizes objects, i.e., properly assigns all the at- 
tributes to the belonging object. It mistakes one attribute in 
2% of cases, and 2 and 3 attributes only in 1% of cases, re- 
spectively. 
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Fig. 19: Real Estate Attributes Evaluation 



Figure 19 illustrates the precision and recall that Am- 
ber achieves on each individual attribute type of the real 
estate domain, where Amber reports nearly perfect recall 
and very high precision (> 96%). The results in the used 
car domain are similar (Figure [2Q| except for location, 
where Amber scores 91.3% precision. The reason is that, 
in this particular domain, car models have a large variety of 
acronyms which happen to coincide with British postcodes, 
e.g., N5 is the postcode of Highbury, London, X5 is a model 
of BMW, that also appear with regularity on the pages. 



Figures 18 shows that on the vast majority of pages 
Amber achieves near perfect accuracy. Notably, in 97% of 
cases, Amber retrieves correctly between 90% and 100% 
of the attributes. The percentage of cases in which Amber 
identifies attributes from all attribute types is above 75%, 
while only one type of attribute is wrong in 17% of the 
pages. For the remaining 6% of pages Amber misidenti- 
fies attributes from 2 or 3 types, with only one page in our 
corpora on which Amber fails for 4 attribute types. This 
emphasizes that on the vast majority of pages at best one or 
two attribute types are problematic for Amber (usually due 
to inconsistent representations or optionality). 





Fig. 20: Used Car Attributes Evaluation 
6.2 Large-Scale Evaluation. 

To demonstrate Amber's ability to deal with a large set of 
diverse sites, we perform an automated experiment beyond 
the sites catalogued in our gold standard. In addition to the 
150 sites in our real estate gold standard, we randomly se- 
lected another 350 sites from the 2810 sites named in the 
yellow pages. On each site, we manually perform a search 
until we reach the first result page and retrieve all subsequent 
n result pages and the expected number of result records 
on the first n — I pages, by manually counting the records 
on the first page and assuming that the number of records 
remains constant on the first n—l pages (on the nih page 
the number might be smaller). This yields 2215 result pages 
overall with an expected number of 20723 results records. 
On this dataset, Amber identifies 20172 records. Since a 
manual annotation is infeasible at this scale, we compare 
the frequencies of the individual types of the extracted at- 
tributes with the frequencies of occurrences in the gold stan- 
dard, as shown in Figure [21] Assuming that both dataset 
are fairly representative selections of the whole set of re- 
sult pages from the UK real-estate domain, the frequencies 
of attributes should mostly coincide, as is the case in Fig- 

price, location, and 



21 Indeed, as shown in Figure 21 



ure 

details page deviate by less than 2%, legal status, bath- 
room, and reception number by less than 5%. The high 
correlation strongly suggests that the attributes are mostly 
identified correctly. Postcode and property type cause a 
higher deviations of 18% and 12%, respectively. They are 
indeed attributes that are less reliably identified by Amber, 
due to the reason explained above for UK postcodes and due 
to the property type often appearing only within the free text 
property description. 



6.3 Comparison with other Tools 

Comparison with RoadRunner. We evaluate Amber 
against RoadRunner fTOlL a fully automatic system for 
web data extraction. RoadRunner does not extract data 
areas and records explicitly, therefore we only compare 
the extracted attributes. RoadRunner attempts to iden- 
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tify all repeated occurrences of variable data ("slots" of 
the underlying page template) and therefore extracts too 
many attributes. For example, RoadRunner extracts on 
some pages more than 300 attributes, mostly URLs and ele- 
ments in menu structures, where our gold standard contains 
only 90 actual attributes. To avoid biasing the evaluation 
against RoadRunner, we filter the output of RoadRun- 
ner, by removing the description block, duplicate URLs, 
and attributes not contained in the gold standard, such as 
page or telephone numbers. 

Another issue in comparing Amber with Road- 
Runner is that RoadRunner only extracts entire text 
nodes. For example, RoadRunner might extract "Price 
£114,995", while Amber would produce "£114,995". 
Therefore we evaluate RoadRunner in two ways, once 



counting an attribute as correctly extracted if the gold stan- 
dard value is contained in one of the attributes extracted 
by RoadRunner (RR « in Figure [23]), and once count- 
ing an attribute only as correctly extracted if the strings ex- 
actly match (RR = in Figure [23]). Finally, as RoadRun- 
ner works better with more than one result page from the 
same site, we exclude sites with a single result page from 



this comparison. The results are shown in Figure 23 Am- 
ber outperforms RoadRunner by a wide margin, which 
reaches only 49% in precision and 66% in recall compared 
to almost perfect scores for Amber. As expected, recall is 
higher than precision in RoadRunner. 

Comparison with MDR. We further evaluate Amber with 
MDR, an automatic system for mining data records in web 
pages. MDR is able to recognize data areas and records, 
but unlike Amber, not attributes. Therefore in our com- 
parison we only consider precision and recall for data areas 
and records in both real estate and used cars domains. Also 
for the comparison with RoadRunner, we avoid biasing 
the evaluation against MDR filtering out page portions e.g., 
menu, footer, pagination links, whose regularity in structure 
misleads MDR. Indeed, these are recognized by MDR as 



data areas or records. Figure [23] illustrates the results. In 
all cases, Amber outperforms MDR which on used-cars 
reports 57% in precision and 72% in recall as best perfor- 
mance. MDR suffers the complex structure of data records, 
which may contain optional information as nested repeated 
structure. This, in turn, are often (wrongly) recognized by 
MDR as record (data area). 



6.4 AMBER Learning 

The evaluation of Amber's learning capabilities is done 
with respect to the upfront learning mode discussed in Sec- 
tion]?] In particular, we want to evaluate Amber's ability 
of constructing an accurate and complete gazetteer for an 
attribute type from an incomplete and noisy seed gazetteer. 
We show that at each learning iteration (see Algorithm [5] in 
Section [4]) the accuracy of the gazetteer is significantly im- 
proved, and that the learning process converges to a stable 
gazetteer after few iterations, even in the case of attribute 
types with large and/or irregular value distributions in their 
domains. 

Setting. In the evaluation that follows we show Amber's 
learning behaviour on the location attribute type. In our 
setting, the term location refers to formal geographical lo- 
cations such as towns, counties and regions, e.g., "Oxford", 
"Hampshire", and "Midlands". Also, it is often the case 
that the value for an attribute type consists of multiple and 
somehow structured terms, e.g., "The Old Barn, St. Thomas 
Street - Oxford". The choice of location as target for the 
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Table 1: Learning performance on G20. 
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Table 2: Learning performance on G25. 
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evaluation is justified by the fact that this attribute type has 
typically a very large domain consisting of ambiguous and 
severely irregular terms. Even in the case of UK locations 
alone, nearly all terms from the English vocabulary either 
directly correspond to a location name (e.g., "Van" is a lo- 
cation in Wales) or they are part of it (e.g., "Barnwood", in 
Gloucestershire). The ground truth for the experiment con- 
sists of a clean gazetteer of 2010 UK locations and 1,560 
different terms collected from a sample of 235 web pages 
sourced from 150 different UK real-estate websites. 



Execution. We execute the experiment on two different seed 
gazetteers G20 (resp. G25) consisting of a random sample 
of 402 (resp. 502) UK locations corresponding to the 20% 
(resp. 25%) of the ground truth. 

By taking as input G20, the learning process saturates 
(i.e., no new terms are learned or dropped) after six iterations 
with a 92.66% accuracy (Fi -score), while with G25, only 5 
iterations are needed for an accuracy of 92.79%. Note that at 
the first iteration the accuracy is 50.54% for G20 and 60.94% 
for G25 Table [T] and Table [2] show the behaviour for each 
learning round. We report the number of locations extracted 
(L £ ), i.e., the number of attribute nodes carrying an annota- 
tion of type location; among these, C E locations have been 
correctly extracted, leading to a precision (resp. recall) of 
the extraction of P E (resp. R E ). The last two columns show 
the number of learned instances (L L ), i.e., those added to the 
gazetteer and, among these, the correct ones (C L ). 

It is easy to see that the increase in accuracy is stable 
in all the learning rounds and that the process quickly con- 
verges to a stable gazetteer. 



7 Related Work 

The key assumption in web data extraction is that a large 
fraction of the data on the web is structured | 6 ] by HTML 
markup and visual styling, especially when web pages are 
automatically generated and populated from templates and 
underlying information systems. This sets web data extrac- 
tion apart from information extraction where entities, re- 
lations, and other information are extracted from free text 
(possibly from web pages). 

Early web data extraction approaches address data ex- 
traction via manual wrapper development l20l or through 
visual, semi-automated tools I3ll26l (still commonly used in 
industry). Modern web data extraction approaches, on the 
other hand, overwhelmingly fall into one of two categories 
(for recent surveys, see l8ll27l): Wrapper induction 11211161 
19,21 , 22 , 24 , 25 , 30 1 starts from a number of manually anno- 
tated examples, i.e., pages where the objects and attributes to 
be extracted are marked by a human, and automatically pro- 
duce a wrapper program which extracts the corresponding 
content from previously unseen pages. Unsupervised wrap- 
per generation [ 10 , 23 , 29 , 33 , 34 , 35 , 36 ] attempts to fully au- 
tomate the extraction process by unsupervised learning of 
repeated structures on the page as they usually indicate the 
presence of content to be extracted. 

Unfortunately, where the former are limited in automa- 
tion, the latter are in accuracy. This has caused a recent flurry 
of approaches l9l [T3l[T4l[32l that like Amber attempt to au- 
tomatise the production of examples for wrapper inducers 
through existing entity recognisers or similar automatic an- 
notators. Where these approaches differ most is how and to 
what extend they address the inevitable noise in these auto- 
matic annotations. 

7.1 Wrapper Induction Approaches 

Wrapper induction can deliver highly accurate results pro- 
vided correct and complete input annotations. The process 
is based on the iterative generalization of properties (e.g., 
structural and visual) of the marked content on the input ex- 
amples. The learning algorithms infer generic and possibly 
robust extraction rules in a suitable format, e.g., XPath ex- 
pressions 111211191 or automata 112 1113011 . that are applicable to 
similar pages for extracting the data they are generated from. 

The structure of the required example annotations dif- 
fers across different tools, impacting the complexity of the 
learned wrapper and the accuracy this wrapper achieves. Ap- 
proaches such as 1 16,24 ] operate on single attribute annota- 
tions, i.e., annotations on a single attribute or multiple, but 
a-priori unrelated attributes. As a result, the wrapper learns 
the extraction rules independently for each attribute, but, 
in the case of multi- attribute objects, this requires a subse- 
quent reconciliation phase. The approaches presented in (21] 
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25,30] are based on annotated trees. The advantage w.r.t. 
single-attribute annotations is that tree annotations make 
easier to recognize nested structures. 

By itself, wrapper induction is incapable of scaling to the 
web. Because of the wide variation in the template structures 
of given web sites, it is practically impossible to annotate 
a sufficiently large page set to cover all relevant combina- 
tions of features indicating the presence of structured data. 
More formally, the sample complexity for web- scale super- 
vised wrapper induction is too high in all but some restricted 
cases, as in e.g. l35l which extracts news titles and bodies. 
Furthermore, traditional wrapper inducers are very sensitive 
to incompleteness and noise in the annotations thus requir- 
ing considerable human effort to create such low noise and 
complete annotations. 



7.2 Unsupervised Web Data Extraction 

The completely unsupervised generation of wrappers has 
been based on discovering regularities on pages presumably 
generated by a common template. Works such as [ 23 , 28 , 29 , 
33,36,37,38] discuss domain- independent approaches that 
only rely on repeated HTML markup or regularities in the 
visual rendering. The most common task that can be solved 
by these tools is record segmentation H28ll37ll38lL where an 
area of the page is segmented into regular blocks each rep- 
resenting an object to be extracted. Unfortunately, these sys- 
tems are quite susceptible to noise in the repeated structure 
as well as to regular, but irrelevant structures such as nav- 
igation menus. This limits their accuracy severely, as also 
demonstrated in Section [6] In Amber, having domain spe- 
cific annotators at hand, we also exploit the underlying re- 
peated structure of the pages, but guided by occurrences 
of regular attributes which allow us to distinguish relevant 
data areas from noise, as well as to address noise among the 
records. This allows us to extract records with higher preci- 
sion. 

A complementary line of work deals with specifically 
stylized structures, such as tables l7l[T8l and lists |[T5ll . The 
more clearly defined characteristics of these structures en- 
able domain-independent algorithms that achieve fairly high 
precision in distinguish genuine structures with relevant data 
from structures created only for layout purposes. They are 
particular attractive for use in settings such as web search 
that optimise for coverage over all sites rather than recall 
from a particular site. 

Instead of limiting the structure types to be recognized, 
one can exploit domain knowledge to train more specific 
models. Domain- dependent approaches such as [35,39] ex- 
ploit specific properties for record detection and attribute la- 
beling. However, besides the difficulty of choosing the fea- 
tures to be considered in the learning algorithm for each do- 



main, changing the domain usually results in at least a partial 
retraining of the models if not an algorithmic redesign. 

More recent approaches are, like Amber and the ap- 
proaches discussed in Section [73} domain-parametric, i.e., 
they provide a domain-independent framework which is 
parameterized with a specific application domain. For in- 
stance, l34l uses a domain ontology for data area identifica- 
tion but ignores it during record segmentation. 

7.3 Combined Approaches 

Besides Amber, we are only aware of three other ap- 
proaches 113U141l32l that exploit the mutual benefit of un- 
supervised extraction and induction from automatic anno- 
tations. All these approaches are a form of self-supervised 
learning, a concept well known in the machine learning 
community and that has already been successfully applied 
in the information extraction setting l3TTl . 

In 1321 , web pages are independently annotated using 
background knowledge from the domain and analyzed for 
repeated structures with conditional random fields (CRFs). 
The analysis of repeated structures identifies the record 
structure in searching for evenly distributed annotations to 
validate (and eventually repair) the learned structure. Con- 
ceptually, l32l differs from Amber as it initially infers a 
repeating page structure with the CRFs independently of the 
annotations. Amber, in contrast, analyses only those por- 
tions of the page that are more likely to contain useful and 
regular data. Focusing the analysis of repeated structures 
to smaller areas is critical for learning an accurate wrapper 
since complex pages might contain several regular structures 
that are not relevant for the extraction task at hand. This 
is also evident from the reported accuracy of the method 
proposed in l32l that ranges between 63% and 85% on at- 
tributes, which is significantly lower than Amber's accu- 
racy. 

This contrasts also with 11731] which aims at making 
wrapper induction robust against noisy and incomplete an- 
notations, such that fully automatic and cheaply generated 
examples are sufficient. The underlying idea is to induce 
multiple candidate wrappers by using different subsets of 
the annotated input. The candidate wrappers are then ranked 
according to a probabilistic model, considering both fea- 
tures of the annotations and the page structure. This work 
has proven that, provided that the induction algorithm satis- 
fies few reasonable conditions, it is possible to produce very 
accurate wrappers for single-attribute extraction, though 
sometimes at the price of hundreds of calls of the wrap- 
per inducer. For multi- attribute extraction, [ 13 ] reports high, 
if considerably lower accuracy than in the single- attribute 
case. More importantly, the wrapper space is considerably 
larger as the number of attributes acts as a multiplicative fac- 
tor. Unfortunately, no performance numbers for the multi- 
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attribute case are reported in fT3l . In contrast, Amber fully 
addresses the problem of multi-attribute object extraction 
from noisy annotations by eliminating the annotation er- 
rors during the attribute alignment. Moreover, Amber also 
avoids any assumptions on a subsequently employed wrap- 
per induction system. 

A more closely-related work is ObjectRtjnner lfT4l . 
a tool driven by an intensional description of the objects to 
be extracted (a SOD in the terminology of fT4ll ). A SOD is 
basically a schema for a nested relation with attribute types. 
Each type comes with associated annotators (or recognizers) 
for annotating the example pages to induce the actual wrap- 
per from by a variant of ExAlg Q. The SOD limits the 
wrapper space to be explored (<20 calls of the inducer) and 
improves the quality of the extracted results. This is similar 
to Amber, though Amber not only limits the search space, 
but also considers only alternative segmentations instead of 



full wrappers (see Section |34| ). On the other hand, the SOD 
can seriously limit the recall of the extraction process, in 
particular, since the matching conditions of a SOD strongly 
privilege precision. The approach is furthermore limited by 
the rigid coupling of attribute types to separators (i.e., to- 
ken sequences acting act as boundaries between different at- 
tribute types). It fact attribute types appear quite frequently 
together with very diverse separators (e.g., caused by a spe- 
cial highlighting or by a randomly injected advertisement). 
The process adopted in Amber is not only tolerant to noise 
in the annotations but also to random garbage content be- 
tween attributes and between records as it is evident from the 
results of our evaluation: Where ObjectRtjnner reports 
that between 65% and 86% of the objects in 5 domains (75% 
in the car domain) are extracted without any error, Amber 
is able to extract over 95% of the objects from the real estate 
and used car domain without any error. 

8 Conclusion 

Amber pushes the state-of-the-art in extraction of multi- 
attribute objects from the deep web, through a fully- 
automatic approach that combines the analysis of the 
repeated structure of the web page and automatically- 
produced annotations. Amber compensates for noise in 
both the annotations and the repeated structure to achieve 
> 98% accuracy for multi-attribute object extraction. To do 
so, Amber requires a small amount of domain knowledge 
that can is proven (Section [4} to be easily obtainable from 
just a few example instances and pages. 

Though Amber is outperforming existing approaches 
by a notable margin for multi- attribute extraction on product 
domains, there remain a number of open issues in Amber 
and multi-attribute object extraction in general: 

(1) Towards irregular, multi-entity domains. Do- 
mains with multiple entity types have not been a focus of 



data extraction systems in the past and pose a particular chal- 
lenge to approaches such as Amber that are driven by do- 
main knowledge. While dealing with the (frequent) case in 
which these heterogeneous objects share a common regu- 
lar attribute is fairly straightforward, more effort it is neces- 
sary when regular attributes are diverse. To this end, more 
sophisticated regularity conditions may be necessary. Simi- 
larly, the ambiguity of instance annotators may be so signif- 
icant that a stronger reliance on labels in structures such as 
tables is necessary. 

(2) Holistic Data Extraction. Though data extraction 
involves several tasks, historically they have always been 
approached in isolation. Though some approaches have con- 
sidered form understanding and extraction from result pages 
together, a truly holistic approach that tries to reconcile in- 
formation from forms, result pages, details pages for indi- 
vidual objects, textual descriptions, and documents or charts 
about these objects remains an open challenge. 

(3) Whole-Domain Database. Amber, as nearly all 
existing data extraction approaches, is focused on extracting 
objects from a given site. Though unsupervised approaches 
such as Amber can be applied to many sites, such a domain- 
wide extraction also requires data integration between sites 
and opens new opportunities for cross-validation between 
domains. In particular, domain- wide extraction enables au- 
tomated learning not only for instances as in Amber, but 
also for new attributes through collecting sufficiently large 
sets of labels and instances to use ontology learning ap- 
proaches. 
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